add python accelerator HAL for multi-vendor backends by theap06 · Pull Request #86 · facebookresearch/gcm

theap06 · 2026-03-03T06:32:31Z

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.

Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

Implements Feature Request #74

Test Plan:

Ran HAL tests:
12 passed

pytest -q gcm/tests/test_accelerator_hal.py 

gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log

[{"gpu_id": 0, "hostname": "129-159-36-52", "mem_util": 0, "gpu_util": 0, "mem_used_percent": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0, "job_id": null, "job_user": null, ...}]
[{"gpu_index": 0, "mem_util": 0, "mem_used_percent": 0, "gpu_util": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}, {"max_gpu_util": 0.0, "min_gpu_util": 0.0, "avg_gpu_util": 0.0, "ram_util": 0.42}]

echo $? 
Output: 0

health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0

[{"node": "129-159-36-52", "cluster": "fair_cluster", "derived_cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", "job_id": 0, "1772265789.8423011": <float>, "1772265789.8663087": <float>}]

echo $? 
Output: 0

github-actions · 2026-03-03T06:32:40Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

theap06 · 2026-03-04T06:00:10Z

Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware

luccabb

adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

lets just keep NVML since its the only backend supported in this PR

luccabb

add some integration to existing paths, then run the cmds before and after these changes to prove its all working as expected:

gcm --backend=nvml --sink=stdout ...

healthchecks ... --backend=nvml --sink=stdout ...

add the above to the test plan

luccabb · 2026-03-04T18:11:16Z

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

luccabb

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

theap06 · 2026-03-08T08:23:42Z

@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback

luccabb

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

theap06 · 2026-03-09T02:10:51Z

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

Done.

theap06 · 2026-03-10T23:13:57Z

@luccabb could I get some feedback? I added the testing for HAL in the pr message.

luccabb · 2026-03-11T19:51:36Z

@claude can you review this pr?

claude · 2026-03-11T19:51:57Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

luccabb · 2026-03-11T21:47:01Z

@claude can you review this pr?

claude · 2026-03-11T21:47:14Z

Claude finished @luccabb's task in 1m 52s —— View job

PR Review: add python accelerator HAL for multi-vendor backends

Gather context and understand request
Read CLAUDE.md for repo guidelines
Read changed files
Provide comprehensive review

Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging.

Critical Issues

1. --backend option is accepted but never used

In both gcm/monitoring/cli/gcm.py:43 and gcm/health_checks/cli/health_checks.py:39, the backend parameter is immediately discarded:

def main(detach: bool, backend: str) -> None:
    _ = backend   # ← never forwarded anywhere

The option appears in --help and is validated by Click, but it does nothing. The new AcceleratorManager/NVMLBackend is never instantiated from the CLI path. The nvml_monitor subcommand still goes directly through the old DeviceTelemetryClient. Until this wiring is added, the HAL is not actually used at runtime. Fix this →

2. get_vbios_version() used to populate the model field — semantically wrong

In gcm/monitoring/accelerator/backends/nvml.py:92:

model = client.get_device_by_index(index).get_vbios_version()

get_vbios_version() returns a firmware version string (e.g. "90.00.38.00.09"), not the GPU model name (e.g. "A100 80GB"). The DeviceHandle.model field implies a human-readable hardware model. This should use get_name() or similar if available on the client. Fix this →

Minor Issues

3. NVMLBackend.close() does not close the underlying client

In gcm/monitoring/accelerator/backends/nvml.py:174-176:

def close(self) -> None:
    self._client = None   # ← just drops the reference; no shutdown call
    return None

If DeviceTelemetryClient holds resources (e.g. NVML context), this leaks them. The close should call self._client.close() (or equivalent) before clearing the reference. Fix this →

4. BackendOperationError stores cause redundantly

In gcm/monitoring/accelerator/errors.py:20-27, cause: Exception is stored as a dataclass field and set as the chained exception via raise ... from e at the call sites. This means the exception appears twice. The __str__ method also manually formats it. Consider dropping the explicit cause field and relying solely on Python's built-in exception chaining (__cause__).

5. NVMLBackend subclasses AcceleratorBackend (a Protocol)

In gcm/monitoring/accelerator/backends/nvml.py:51:

@dataclass
class NVMLBackend(AcceleratorBackend):

AcceleratorBackend is defined as a Protocol. You don't need to explicitly inherit from a Protocol to satisfy it — you just need to implement the methods. Subclassing a Protocol creates a coupling that isn't needed and is non-idiomatic. NVMLBackend should just be a plain class (or dataclass) that happens to implement all the AcceleratorBackend methods.

Observations / Non-blocking

MetricRequest.include_process_info is ignored in read_metrics: The _request parameter in NVMLBackend.read_metrics is never read. If process info collection is planned, a TODO comment would clarify intent.
probe() is called before enumerate_devices() in tests but not enforced: Nothing stops a caller from calling enumerate_devices() on an unprobed backend (the client will be lazily initialized, which may raise). A guard or documented contract would help.
Test coverage is good — the 12 HAL unit tests cover normal operation, partial failures, missing library, fallback paths, manager lifecycle, and registry. The new integration tests in test_gcm.py gracefully handle no-GPU environments.
README is clear and well-written — the lifecycle section and backend authoring guide are helpful.

Summary

The HAL design is solid and the tests are thorough. The main issue is that the --backend flag is not wired to the HAL — the new code is tested in isolation but not yet integrated into the runtime monitoring path, which undermines the core goal of the PR. The model field bug and close() resource leak should also be fixed. Happy to see this land once those are addressed.

…ry CI changes

theap06 · 2026-03-18T06:22:03Z

@claude review my pr

luccabb · 2026-03-18T08:38:13Z

@claude

claude · 2026-03-18T08:38:29Z

Claude finished @luccabb's task in 1m 8s —— View job

PR Re-Review: add python accelerator HAL for multi-vendor backends

Gather context
Read changed files and CI diff
Provide updated review

One Issue Blocking Merge

.github/workflows/gcm_python.yml deletes the Rust installation step that exists on main

The PR removes these 5 lines from main (around line 67):

      - name: Install Rust
        run: |
          curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
          cp -r $HOME/.cargo/bin/* $HOME/.cache/venv-ci/bin/

This step was added to main independently to fix a build dependency that requires Rust. It has nothing to do with this PR. Removing it here means merging this PR would silently break the Rust-dependent build on main.

The author's explanation ("reverted it since I removed the related code from this PR") suggests they added this step in their branch and are now reverting their own addition — but this step was already on main before this PR branched off. The diff is subtractive relative to main, not just undoing a change introduced in this branch.

Fix: Restore those 5 lines so the workflow file is identical to main. This PR should not touch .github/workflows/gcm_python.yml at all. Fix this →

Everything else from all prior reviews remains fixed. This is the only remaining change needed before merge.

theap06 · 2026-03-18T16:51:00Z

@luccabb added back the rust dep

luccabb

you want this PR to be in a state where (1) ALL existing NVML paths from monitoring and health checks are going through the new accelerator interface and behavior is the same as before the PR. After (1) is accomplished add test plan that exercises every code path that was updated with before and after sections to prove that its the same functionality.

luccabb

please address #86 (review)

theap06 · 2026-03-20T22:02:48Z

please address #86 (review)

addressed

theap06 · 2026-03-25T04:04:49Z

@luccabb could i get some feedback on this pr

theap06 · 2026-04-01T06:41:46Z

@luccabb it should be ready to merge

theap06 · 2026-04-05T09:07:12Z

@luccabb will close this pr.

theap06 requested review from calebho, giongto35, jj10306 and luccabb as code owners March 3, 2026 06:32

meta-cla bot added the cla signed label Mar 3, 2026

facebook-github-bot added the module: rocm label Mar 3, 2026

theap06 marked this pull request as draft March 3, 2026 06:35

theap06 marked this pull request as ready for review March 3, 2026 07:05

luccabb requested changes Mar 4, 2026

View reviewed changes

Comment thread gcm/monitoring/accelerator/backends/levelzero.py Outdated

theap06 requested a review from luccabb March 4, 2026 08:33

luccabb requested changes Mar 4, 2026

View reviewed changes

theap06 requested a review from A-Kokolis as a code owner March 4, 2026 09:00

theap06 requested a review from luccabb March 4, 2026 09:01

luccabb requested changes Mar 4, 2026

View reviewed changes

Comment thread gcm/tests/data/sample-sacct-fake-line-break-expected.txt

theap06 requested a review from luccabb March 4, 2026 19:12

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 44b06c0 to 49fa293 Compare March 4, 2026 19:25

luccabb requested changes Mar 4, 2026

View reviewed changes

theap06 requested a review from luccabb March 4, 2026 22:26

luccabb requested changes Mar 8, 2026

View reviewed changes

theap06 requested a review from luccabb March 9, 2026 02:10

luccabb requested changes Mar 17, 2026

View reviewed changes

Comment thread .github/workflows/gcm_python.yml Outdated

Comment thread gcm/exporters/telemetry.py Outdated

Addressing PR review comments: removing telemetry files and unnecessa…

333dcc1

…ry CI changes

theap06 requested a review from luccabb March 18, 2026 04:08

luccabb requested changes Mar 18, 2026

View reviewed changes

Comment thread .github/workflows/gcm_python.yml

Comment thread .github/workflows/gcm_python.yml

theap06 requested a review from luccabb March 18, 2026 16:50

luccabb requested changes Mar 19, 2026

View reviewed changes

Comment thread website/docs/GCM_Monitoring/accelerator_hal.md

Comment thread gcm/accelerator/__init__.py

Comment thread dev-requirements.txt Outdated

Comment thread gcm/monitoring/accelerator/backends/nvml.py Outdated

Comment thread gcm/accelerator/backends/nvml.py Outdated

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 8694aef to 5e0fbb9 Compare March 19, 2026 06:21

theap06 requested a review from luccabb March 19, 2026 06:22

luccabb requested changes Mar 19, 2026

View reviewed changes

Comment thread gcm/accelerator/metrics.py Outdated

luccabb reviewed Mar 19, 2026

View reviewed changes

Comment thread gcm/accelerator/backends/nvml.py Outdated

Comment thread gcm/accelerator/backends/nvml.py Outdated

Comment thread gcm/accelerator/backends/nvml.py Outdated

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 5e0fbb9 to d694a52 Compare March 20, 2026 05:38

theap06 requested a review from luccabb March 20, 2026 05:39

luccabb requested changes Mar 20, 2026

View reviewed changes

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from d694a52 to 3989d0d Compare March 20, 2026 21:35

Restore Rust installation step from main

ccc9129

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 3989d0d to ccc9129 Compare March 20, 2026 21:37

Refactor legacy tools to use Accelerator HAL Adapter

effaef8

theap06 added 2 commits March 20, 2026 15:35

Fix flaky integration test in test_gcm.py and update test plan

c6b0855

Fix mock patching in test_check_nvidia_smi_hal_parity.py

9ef8a60

theap06 requested a review from luccabb March 20, 2026 23:06

Merge branch 'main' into feat/accelerator-hal-upstream-clean

b7298e1

theap06 closed this Apr 5, 2026

Conversation

theap06 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

CI Commands

Uh oh!

theap06 commented Mar 4, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luccabb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccabb commented Mar 4, 2026

Uh oh!

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 9, 2026

Uh oh!

theap06 commented Mar 10, 2026

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: add python accelerator HAL for multi-vendor backends

Critical Issues

Minor Issues

Observations / Non-blocking

Summary

Uh oh!

Uh oh!

Uh oh!

theap06 commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

luccabb commented Mar 18, 2026

Uh oh!

claude bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-Review: add python accelerator HAL for multi-vendor backends

One Issue Blocking Merge

Uh oh!

theap06 commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luccabb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 20, 2026

Uh oh!

theap06 commented Mar 25, 2026

theap06 commented Mar 3, 2026 •

edited

Loading

luccabb left a comment •

edited

Loading

claude bot commented Mar 11, 2026 •

edited

Loading

claude bot commented Mar 18, 2026 •

edited

Loading

luccabb left a comment •

edited

Loading